Refactor little endian binary decoding into a shared internal utility#182
Merged
mchav merged 2 commits intoDataHaskell:mainfrom Mar 15, 2026
Merged
Conversation
… module and update Parquet and Lazy binary readers to use the same implementation so byte decoding behavior is consistent across code paths and easier to maintain; this removes duplicated decode logic, keeps existing APIs intact through Parquet.Binary wrappers, adds and preserves endian regression coverage, and confirms compatibility by running formatting, lint, full build, and the complete test suite successfully.
mchav
requested changes
Mar 12, 2026
src/DataFrame/IO/Parquet/Binary.hs
Outdated
| ) | ||
| | otherwise = | ||
| littleEndianWord32 (BS.take 4 $ bytes `BS.append` BS.pack [0, 0, 0, 0]) | ||
| littleEndianWord32 = Binary.littleEndianWord32 |
Member
There was a problem hiding this comment.
We don't need to keep this re export.
| littleEndianWord32 bytes | ||
| | len >= 4 = | ||
| assembleWord32 | ||
| (BS.index bytes 0) |
src/DataFrame/Internal/Binary.hs
Outdated
|
|
||
| byteAtOrZero :: Int -> BS.ByteString -> Int -> Word8 | ||
| byteAtOrZero len bytes i | ||
| | i < len = BS.index bytes i |
…y and remove redundant re exports; update parquet modules and tests to import DataFrame.Internal.Binary directly, and fix byteAtOrZero bounds check to avoid unsafe indexing on negative offsets while keeping guarded fast path indexing unchanged
mchav
approved these changes
Mar 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR addresses endian decoding consistency and performance by replacing loop-based byte-to-word decoding with unrolled little-endian helpers, then moving those helpers into a
shared internal binary utility used by both Parquet and Lazy binary paths. This keeps behavior aligned across code paths, removes duplicate implementations, and makes future binary
changes safer because there is one canonical implementation.